Import the needed packages
library(tidyverse)
library(openxlsx)
library(stringr)
library(plyr)
library(varhandle)
library(reshape)
library(factoextra)
getwd() # Check that working directory is correct. This code uses relative paths.
[1] "F:/Google Drive (tsakal@ucsb.edu)/Book Network Paper/analysis/PCA code"
We write some functions that allow use to make a comparison table of all different communities and subcommunities. The end result is a table that has a list of the top books in a community and, next to it, the books in each of its subcommunities. We write two functions that help automate this.
The tables we create are not in tidy format – rather they are in a format for humans to read.
This first function extends a dataframe to be a given length by filling in all the extra spaces with NA. This is done so we can use the bind_cols function to place tables next to eachother.
extend.dataframe <- function(df, len){
# Extends a dataframe to a certain length by filling in rows with NA
# Todo: vectorize this function for speedup
num.rows = nrow(df)
for (i in 1:len){
if (i > num.rows){
df[i, ] = NA
}
}
return(df)
}
The next function is the main code. It takes in a dataframe (rdf), the column we wish to order the results by (ordering), and whether we are breaking the entire network into communities or each community into subcommunities (subcomuns).
comparison.table <- function(rdf, ordering, subcomuns= FALSE){
# rdf: A dataframe with each book, the community it is in, and other columns describing various metrics
# ordering: The column name of the metric we wish to order by
# subcomuns: Set to true if we are taking a dataframe corresponding to a community and breaking it into subcommunity. (This just changes the name of the column to match.)
rdf = rdf[,c("title", "modularity_class", "eigencentrality", "Weighted.Degree", "Label")] # Reorder Columns and extract only the ones we care about.
# Decide whether to call the modularity_class column communities or subcommunities
community = "Community"
if (subcomuns == TRUE){
community = "Subcommunity"}
# Rename the columns to better reflect paper's conventions
rdf = rename(rdf,
c(modularity_class = community,
eigencentrality = "Eigencentrality",
Weighted.Degree = "Weighted Degree",
Label = "Goodreads Id",
title = "Title")
)
# Make sure column we are ordering by is numeric instead of a factor
rdf[["Eigencentrality"]] = as.numeric(as.character(rdf[["Eigencentrality"]]))
rdf[["Weighted Degree"]] = as.numeric(as.character(rdf[["Weighted Degree"]]))
# Round both eigencentrality and weighted degree columns.
# Make sure to do this after sorting so order is maintained
rdf[["Eigencentrality"]] = round(rdf[["Eigencentrality"]] , digits = 3)
rdf[["Weighted Degree"]] = round(rdf[["Weighted Degree"]] , digits = 2)
rdf = rdf[order(rdf[[ordering]], decreasing=TRUE),]
# Get the number of books and communities
c = count(rdf, community)
num.communities = nrow(c)
num.books = nrow(rdf)
dflist = list()
dflist = append(dflist, rdf)
for (i in 0:(num.communities-1)){
df = rdf[rdf[community] == i,]
df = df[order(df[[ordering]], decreasing=TRUE),]
df = extend.dataframe(df, num.books) # Extend dataframe so that can use bind_cols
df = df[,c("Title", "Eigencentrality", "Weighted Degree")] # Drop the redundant community column
dflist = append(dflist, df)
}
df.full = bind_cols(dflist)
return(df.full)
}
First let’s make a table the takes the reader and enjoyment networks and breaks each of them into a table of their subcommunities.
# For the reader and enjoyment network make two tables each, one sorted by eigencentrality and the other by weighted degree.
for (network.type in list("enjoyment", "reader")){
for (ordering in list("Eigencentrality", "Weighted Degree")){
print(str_glue("Creating comparison table for {network.type}, ordered by {ordering}."))
# Create table
df = read.delim2(str_glue("./raw data/full {network.type} network.csv"), sep = ",") # Read the correct table here
df = comparison.table(df, ordering)
# Save table in the processed data folder
path = str_glue("./processed data/comparison_table_{network.type}_by_{ordering}.xlsx")
write.xlsx(df, path)
print(df)
}
}
Creating comparison table for enjoyment, ordered by Eigencentrality.
Creating comparison table for enjoyment, ordered by Weighted Degree.
Creating comparison table for reader, ordered by Eigencentrality.
Creating comparison table for reader, ordered by Weighted Degree.
Next we will do the same thing for each of the subcommunities. To make the raw data for these we manually exported each table from Gephi by taking the entire network, filtering out those books not in the focal modularity class, and recalculating the Gephi statistics we care about (modularity class, eigencentrality, and weighted degree).
# Create the same tables, but now for subcommunities of each community.
for (network.type in list("enjoyment", "reader")){
for (ordering in list("Eigencentrality", "Weighted Degree")){
path = str_glue("./raw data/{network.type} subcommunities/") # Folder with subcommunity data
file_names = list.files(path, pattern="*.csv") # List all .csv files in this folder
# Create the table for each file
for (i in 1:length(file_names)){
f = file_names[[i]]
f = paste(path, f, sep = "")
print(f)
df = read.delim2(f, sep = ",")
df.full = comparison.table(df, ordering, subcomuns = TRUE)
write.xlsx(df.full, str_glue("./processed data/{network.type} subcommunities/comp_table_{network.type}_{i-1}_by_{ordering}.xlsx"))
}
}
}
[1] "./raw data/enjoyment subcommunities/Enjoyment Community 0.csv"
[1] "./raw data/enjoyment subcommunities/Enjoyment Community 1.csv"
[1] "./raw data/enjoyment subcommunities/Enjoyment Community 2.csv"
[1] "./raw data/enjoyment subcommunities/Enjoyment Community 3.csv"
[1] "./raw data/enjoyment subcommunities/Enjoyment Community 4.csv"
[1] "./raw data/enjoyment subcommunities/Enjoyment Community 5.csv"
[1] "./raw data/enjoyment subcommunities/Enjoyment Community 6.csv"
[1] "./raw data/enjoyment subcommunities/Enjoyment Community 0.csv"
[1] "./raw data/enjoyment subcommunities/Enjoyment Community 1.csv"
[1] "./raw data/enjoyment subcommunities/Enjoyment Community 2.csv"
[1] "./raw data/enjoyment subcommunities/Enjoyment Community 3.csv"
[1] "./raw data/enjoyment subcommunities/Enjoyment Community 4.csv"
[1] "./raw data/enjoyment subcommunities/Enjoyment Community 5.csv"
[1] "./raw data/enjoyment subcommunities/Enjoyment Community 6.csv"
[1] "./raw data/reader subcommunities/Reader Community 0.csv"
[1] "./raw data/reader subcommunities/Reader Community 1.csv"
[1] "./raw data/reader subcommunities/Reader Community 2.csv"
[1] "./raw data/reader subcommunities/Reader Community 3.csv"
[1] "./raw data/reader subcommunities/Reader Community 4.csv"
[1] "./raw data/reader subcommunities/Reader Community 5.csv"
[1] "./raw data/reader subcommunities/Reader Community 0.csv"
[1] "./raw data/reader subcommunities/Reader Community 1.csv"
[1] "./raw data/reader subcommunities/Reader Community 2.csv"
[1] "./raw data/reader subcommunities/Reader Community 3.csv"
[1] "./raw data/reader subcommunities/Reader Community 4.csv"
[1] "./raw data/reader subcommunities/Reader Community 5.csv"
There’s a few last tables to create. The reader network breaks into five communities and the enjoyment network breaks into seven. It is more than possible that if we force the reader network to break into seven communities then they will be the same communities as the enjoyment network. Visa versa if we break the enjoyment network into only five communities.
We can change the community size by going back to the gephi files for our network and recalculating the modularity, but now tweaking the resolution until we get the correct number of communities.
For this check we don’t go through the steps of restricting the network to each subcommunity before calculating eigencentrality and weighted degree. This is because that step at most changes the ordering slightly and we don’t require it for a check to see if the communities are vaguely the same.
# Again, sort by eigencentrality and the other by weighted degree.
for (ordering in list("Eigencentrality", "Weighted Degree")){
print(str_glue("Creating comparison table for {network.type}, ordered by {ordering}."))
# Create table
df = read.delim2(str_glue("./raw data/full enjoyment network 6 communities.csv"), sep = ",") # Name of the new reader network data.
df = comparison.table(df, ordering)
# Save table in the processed data folder
path = str_glue("./processed data/comparison_table_enjoyment_6_comns_by_{ordering}.xlsx")
write.xlsx(df, path)
print(df)
}
Creating comparison table for reader, ordered by Eigencentrality.
Creating comparison table for reader, ordered by Weighted Degree.
Now we ask whether a list of subjects for each book can predict the community it is in.
sdf = read.delim2("./raw data/subjects_reduced.csv", sep = ",", colClasses = "numeric", na.strings = "") # read in dataframe of subjects
print(sdf)
# Rename the columns to better reflect paper's conventions
sdf = rename(sdf, c(X = "Goodreads Id") )
print(sdf)
tally(sdf)
# For the reader and enjoyment network
for (network.type in list("enjoyment", "reader")){
print(str_glue("Creating subject table for {network.type}."))
# Read book data table and rename columns
df = read.delim2(str_glue("./raw data/full {network.type} network.csv"), sep = ",") # Read the correct table here
df = rename(df,
c(modularity_class = "Community",
eigencentrality = "Eigencentrality",
Weighted.Degree = "Weighted Degree",
Label = "Goodreads Id",
title = "Title")
)
df = inner_join(sdf, df, by=c("Goodreads Id"))
#
# Save table in the processed data folder
#path = str_glue("./processed data/comparison_table_{network.type}_by_{ordering}.xlsx")
#write.xlsx(df, path)
#df = lapply(df, function(x) as.numeric(as.character(x)))
#df = do.call(cbind, df)
# get the sums for all the subject columns.
non.sub.cols = c("Goodreads Id", "Title", "subjects", "Weighted Degree", "Eigencentrality", "clustering", "pageranks", "triangles", "Id", "Degree", "Community", "bipartite")
df = colSums(df[,!names(df) %in% non.sub.cols], na.rm = TRUE )
df[]
df = df[order(df[["V1"]], decreasing=TRUE),]
# Get the number of communities so can loop through them
#c = count(df, "Community")
#num.communities = nrow(c)
#for (i in 0:(num.communities-1)){
# cdf = df[df["Community"] == i,]
# cdf = cdf[order(cdf[["Eigencentrality"]], decreasing=TRUE),]
# cdf = summarise_all(cdf)
# print(cdf)
}
Creating subject table for enjoyment.
Error in df[["V1"]] : subscript out of bounds
We can’t use the pure subject counts beacuse without normalizing the data in some way we have that popular subjects are the most influencial and thus dominate the entire analysis.
Remeber than for the
library(readxl)
#df = read.delim2(str_glue("./raw data/relative_subject_counts_reader.csv"), sep = ",") # Read the correct table here
df = read_excel("./raw data/relative_subject_counts_enjoyment.xlsx") # Read the correct table here
New names:
* `` -> ...1
# Both the above tables give the same pca analysis
df = df %>% column_to_rownames(., var = "...1") # Set label to row names. Var should equal the name of the first column.
num.communities = ncol(df) - 1 # Minus 1 all counts column
# Drop the all group because this makes all results the same
df = df[ , -which(names(df) %in% c("count_all"))]
for (i in 0:(num.communities-1)){
colname = str_glue("count_{i}")
df[[colname]] = as.numeric(as.character(df[[colname]]))
}
# Rename the columns for interpretation
# Names for enjoyment
names(df)[1:num.communities] = c("Thriller", "Fantasy/Scifi", "Manga", "(Modern) Classics", "Children's", "Contemporary/Realistic", "Young Adult")
# Names for reader
#names(df)[1:num.communities] = c("Children's", "(Modern) Classics", "Genre Fiction", "Young Adult", "Contemporary/Realistic")
# Remove fiction row
#row.names.remove <- c("Fiction")
#df = df[!(row.names(df) %in% row.names.remove), ]
df["counts_all"] = NULL
subjects.pca = prcomp(df,center = TRUE, scale. = TRUE)
summary(subjects.pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 1.3114 1.1183 1.0927 1.0221 0.9682 0.9207 0.07471
Proportion of Variance 0.2457 0.1787 0.1706 0.1492 0.1339 0.1211 0.00080
Cumulative Proportion 0.2457 0.4244 0.5949 0.7442 0.8781 0.9992 1.00000
fviz_eig(subjects.pca)
We see all components are important except the fifth. This makes sense because each of the eigenvectors should correspond to a community. If a pca could explain 100% of the variance in the number of communities that means that the communities had no overlap? So how far away from this we are corresponds to the mixing between genres.
coords = get_pca_ind(subjects.pca)$coord
coords.df = as.data.frame(coords)
get_pca(subjects.pca)$cor
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
Thriller 0.01628752 0.81721542 0.20812923 0.5325549 -0.06396264 -0.003437558 -0.029293099
Fantasy/Scifi 0.50922021 -0.04524416 0.52816728 -0.3382731 0.35775027 0.465452699 -0.025037911
Manga 0.19226999 -0.01129886 0.46047943 -0.3212773 -0.78600667 -0.172609079 -0.006619265
(Modern) Classics -0.84022900 -0.29614787 0.26417188 -0.1000065 0.18157684 -0.303043346 -0.041450941
Children's 0.14479598 -0.65192742 -0.09012555 0.6013011 -0.28274165 0.322349708 -0.022066254
Contemporary/Realistic -0.39605095 0.25442978 -0.58655752 -0.3912910 -0.26062305 0.461274899 -0.023490877
Young Adult 0.73450310 -0.05686448 -0.48763811 -0.1371158 0.08206854 -0.438918190 -0.035993250
fviz_pca_biplot(subjects.pca,
col.var = "contrib", # Color by contributions to the PC
#col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE, # Avoid text overlapping
#addEllipses = TRUE,
label = c("quanti.sup", "var"),
alpha.ind = .1,
alpha.var = 1,
)
NA
NA
fviz_pca_var(subjects.pca,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
library("corrplot")
corrplot 0.84 loaded
var <- get_pca_var(subjects.pca)
corrplot(var$contrib, is.corr=FALSE, method="square")
corrplot(var$cor, is.corr=FALSE, method="square", tl.col = 1)